No unapproved extension of the deadline is allowed. Late submission will lead to 0 credit.
Discussion is encouraged on Ed as part of the Q/A. However, all assignments should be done individually.
This assignment consists of both programming and theory questions.
Unless a theory question explicitly states that no work is required to be shown, you must provide an explanation, justification, or calculation for your answer.
To switch between cell for code and for markdown, see the menu -> Cell -> Cell Type
You can directly type Latex equations into markdown cells.
If a question requires a picture, you could use this syntax <img src="" style="width: 300px;"/> to include them within your ipython notebook.
Your write up must be submitted in PDF form. You may use either Latex, markdown, or any word processing software. We will **NOT** accept handwritten work. Make sure that your work is formatted correctly, for example submit $\sum_{i=0} x_i$ instead of \text{sum_{i=0} x_i}
pairwise_dist [5 pts] - programming
KMeans Implementation [35pts] - programming
Silhouette Coefficient [10 pts]
DBScan [10 pts] - programming BONUS FOR UNDERGRAD
2.1 Performing EM Update [15 pts] - non-programming
2.1.1 [3pts] - non-programming
2.1.2 [3pts] - non-programming
2.1.3 [9pts] - non-programming
3.1 Helper Functions [15pts] - programming & non-programming
3.1.1. softmax [5pts]
3.1.2. logsumexp [3pts + 2pts] - programming & non-programming
3.1.3. normalPDF [5pts] - for CS4641 students only
3.1.3. multinormalPDF [5pts] - for CS7641 students only
3.2 GMM Implementation [30pts] - programming
3.2.1. init_components [5pts]
3.2.2._ll_joint [10pts]
3.2.3. Setup iterative steps for EM algorithm [15pts]
3.3 Image Compression and Pixel clustering [10pts] - non-programming
4.1: KNN [12pts] - programming
4.1.a. complete, incomplete, unlabeled_ [3pts]
4.1.b. CleanData __call__ [7pts]
4.1.c. MeanCleanData [2pts]
4.2: Getting acquainted with semi-supervised learning approaches [5pts] - non-programming
4.3: Implementing the EM algorithm [10pts] - programming
_init_components [5pts]
SemiSupervised __call__ [5pts]
4.4: Demonstrating the performance of the algorithm [5pts] - programming
accuracy_semi_supervised [2.5pts]
accuracy_GNB [2.5pts]
4.5: Interpretation of Results [2pts] - non-programming
Note: It is highly recommended that you do Q4 (if not for the HW then before the project) as it teaches you imperfect data handling and a good understanding of how the models you have learnt can be used together for better results.
This notebook is tested under python 3.**.**, and the corresponding packages can be downloaded from miniconda. You may also want to get yourself familiar with several packages:
You can create a python conda environment with the necessary packages using the instructions in the environment/environment_setup.md file.
Please implement the functions that have "raise NotImplementedError", and after you finish the coding, please delete or comment "raise NotImplementedError".
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
%matplotlib inline
import sys
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import localtests
from mpl_toolkits.mplot3d import axes3d
from tqdm import tqdm
print('Version information')
print('python: {}'.format(sys.version))
print('matplotlib: {}'.format(matplotlib.__version__))
print('numpy: {}'.format(np.__version__))
# Load image
import imageio
%load_ext autoreload
%autoreload 2
Version information python: 3.9.13 (main, Aug 25 2022, 23:51:50) [MSC v.1916 64 bit (AMD64)] matplotlib: 3.5.2 numpy: 1.23.1 The autoreload extension is already loaded. To reload it, use: %reload_ext autoreload
Rob Boss, a perpetual Ph.D student at Georgia Tech University, is teaching his dissertation robot, Pablo, how to paint in preparation for the Clough Art Crawl. Unfortunately, Rob is on a shoestring PhD budget and can only afford to buy Pablo a limited variety of colors to practice with. Luckily Rob has taken Machine Learning and remembers that KMeans clustering can be used to compress images down to a few colors. Here you will help Rob realize Pablo's dreams of electric sheep by implementing KMeans.
(Artwork generated by Midjourney, now in open beta!)

KMeans is trying to solve the following optimization problem:
\begin{align} \arg \min_S \sum_{i=1}^K \sum_{x_j \in S_i} ||x_j - \mu_i||^2 \end{align}where one needs to partition the N observations into K clusters: $S = \{S_1, S_2, \ldots, S_K\}$ and each cluster has $\mu_i$ as its center.
In this section, you are asked to implement pairwise_dist function.
Given $X \in \mathbb{R}^{N \times D}$ and $Y \in \mathbb{R}^{M \times D}$, obtain the pairwise distance matrix $dist \in \mathbb{R}^{N \times M}$ using the euclidean distance metric, where $dist_{i, j} = ||X_i - Y_j||_2$.
DO NOT USE FOR LOOPS in your implementation, using for-loops or while-loops will result in 0 credit for this portion. Use array broadcasting instead.
We have provided some unit tests in localtests.py for you to check your implementation. See Using the Local Tests for more details.
localtests.KMeansTests().test_pairwise_dist()
localtests.KMeansTests().test_pairwise_speed()
UnitTest passed successfully! UnitTest passed successfully!
In this section, you are asked to implement several methods in kmeans.py
The Kmeans algorithm is sensitive to how the centers are initialized. The naive approach is to randomly initialize the centers. However, a bad initialization can increase the time required for convergence or may even converge to a non-optimal solution.
The algorithm for KMPP that you will implement can be described as follows:
After you've chosen your centers, you will need to update the membership of each point based on the closest center. You will implement this in _update_assignment. See docstring for more details.
Since cluster memberships may have changed, you will need to update the cluster centers. You will implement this in _update_centers. See docstring for more details.
We will consider KMeans to be converged when the change in loss drops below a threshold value. The loss will be defined as the sum of the squared distances between each point and it's respective center. The iteration is implemented for you in the __call__ method.
We have provided the following local tests to help you check your implementation. Provided unit-tests are meant as a guide and are not intended to be comprehensive. See Using the Local Tests for more details.
localtests.KMeansTests().test_init()
localtests.KMeansTests().test_update_centers()
localtests.KMeansTests().test_kmeans_loss()
UnitTest passed successfully! UnitTest passed successfully!
We will now take a look at how image quality is impacted by the number of clusters
###############################
### DO NOT CHANGE THIS CELL ###
###############################
#Note that because of a different file structure, students' paths will be different
from utilities import *
image_values = image_to_matrix('./data/images/electric_sheep1.png')
r = image_values.shape[0]
c = image_values.shape[1]
ch = image_values.shape[2]
# flatten the image_values
image_values = image_values.reshape(r*c,ch)
print('Loading...')
image_2 = update_image_values(2, image_values, r, c, ch).reshape(r, c, ch)
image_5 = update_image_values(5, image_values, r, c, ch).reshape(r, c, ch)
image_10 = update_image_values(10, image_values, r, c, ch).reshape(r, c, ch)
image_20 = update_image_values(20, image_values, r, c, ch).reshape(r, c, ch)
plot_image([image_2, image_5, image_10, image_20], ['K = 2', 'K = 5', 'K = 10', 'K = 20'])
Loading...
To obtain these 5 points, you need to be pass the tests set up in the autograder. These will test the centers created by your implementation. Be sure to upload the correct files to obtain these points.
But how many clusters is the right number of clusters? In this section, you will implement the silhouette_coefficient function in kmeans.py to help you evaluate the performance of the clusters.
As presented in class, we define the Silhouette Coefficient for a clustering C is \begin{align*} SC = \frac{1}{n}\sum_{i=1}^n{s_i} \end{align*} With individual Silhouette Coefficients at a point $x_i$ defined as \begin{align*} s_i = \frac{\mu_{out}^{min}(x_i) - \mu_{in}(x_i)}{\max\{\mu_{out}^{min}(x_i),\mu_{in}(x_i)\}} \end{align*} where we defined $\mu_{out}^{min}(x_i)$ and $\mu_{in}(x_i)$ as follows: \begin{align*} \mu_{out}^{min}(x_i) = \min_{j \neq \hat{y}_i} \bigg\{\frac{\sum_{y \in C_j}\delta(x_i, y)}{n_j}\bigg\} \\ \mu_{in}(x_i) = \frac{\sum_{x_j \in C_{\hat{y}_i, j \neq i}}\delta(x_i, x_j)}{n_{\hat{y}} - 1} \end{align*}
Due to the high computational cost of calculating the Silhouette Coefficient, we reduce the image resolution before applying kmeans and calculating the coefficient value. We have done this for you in the cell below, so you will only be responsible for implementing the silhouette_coefficient method in kmeans.py.
NOTE: Based on implementation, the coefficient floating points may vary slightly. If you observe your coefficients are close to those in the local tests i.e. the difference is less than $10e^{-6}$, your solution will likely pass the Gradescope tests.
HINT: There are multiple ways to approach this problem, but you are provided with centers_mapping, which maps the index of a cluster to a numpy array of points in that cluster. These numpy arrays are conveniently constructed such that they can be directly used as an argument in pairwise_distance.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from kmeans import KMeans, silhouette_coefficient
def calculate_coefficient(k, pixels):
cluster_idx, centers, loss, centers_mapping = KMeans()(pixels, k, center_mapping=True)
coefficient, _, _ = silhouette_coefficient(pixels, cluster_idx, centers,centers_mapping)
return coefficient
# reduce the resolution of the image and find the solhouette_coefficient value
image_values = image_to_matrix('./data/images/electric_sheep1.png')
from skimage.transform import resize
res = resize(image_values, (76, 111)).reshape(76* 111, 3)
print("Calculated Silhouette Coefficient (k = 3)", calculate_coefficient(3, res))
print("Expected Silhouette Coefficient (k = 3)", 0.47735408960571196)
print("Calculated Silhouette Coefficient (k = 5)", calculate_coefficient(5, res))
print("Expected Silhouette Coefficient (k = 5)", 0.5041202297356476)
print("Calculated Silhouette Coefficient (k = 10)", calculate_coefficient(10, res))
print("Expected Silhouette Coefficient (k = 10)", 0.40500990716873275)
Calculated Silhouette Coefficient (k = 3) 0.47766666828319226 Expected Silhouette Coefficient (k = 3) 0.47735408960571196 Calculated Silhouette Coefficient (k = 5) 0.5043166869250038 Expected Silhouette Coefficient (k = 5) 0.5041202297356476 Calculated Silhouette Coefficient (k = 10) 0.40194178013557763 Expected Silhouette Coefficient (k = 10) 0.40500990716873275
We have additionally provided the following unit test to help you check your implementation. See Using the Local Tests for more details. We also provide the expected $\mu_{in}$ and $\mu_{out}$ values for debugging purposes:
localtests.KMeansTests().test_silhouette_coefficient()
k = 2 Your mu_ins: [0.8534373629473045 0.8386312195042827 1.5962448822560127 1.11977494036528 1.4716126698261671 1.133624891808676 1.2039993592132445 1.7162549426019713 1.409577728165928 1.027739542096188] Expected mu_ins: [0.85343736 0.83863122 1.59624488 1.11977494 1.47161267 1.13362489 1.20399936 1.71625494 1.40957773 1.02773954] Your mu_outs: [3.0361462050266335 2.474233939836149 2.5654514221409146 2.9019039155974076 3.015717320561218 2.6330094282704515 2.918066755957745 2.238116576477043 1.72995378610742 3.1957979761190884] Expected mu_outs: [3.03614621 2.47423394 2.56545142 2.90190392 3.01571732 2.63300943 2.91806676 2.23811658 1.72995379 3.19579798] k = 3 Your mu_ins: [0.8645916478447805 0.7626347660326419 1.5962448822560127 1.11977494036528 1.4716126698261671 1.133624891808676 -- 1.7162549426019713 1.3263609386715867 0.9469279721182694] Expected mu_ins: [0.85343736 0.83863122 0.91669768 0.81462509 0.98900666 0.58123809 1.20399936 0.98900666 1.40957773 1.02773954] Your mu_outs: [0.8199745082548768 1.066620579919205 2.417366983273371 2.8524593533237943 2.894686183547889 2.6117104781285105 1.2039993592132445 2.0838363370036195 1.72995378610742 1.270174252029943] Expected mu_outs: [3.00760695 2.36916244 2.27579209 1.42492479 1.63248134 1.68601169 2.74508479 1.95867104 1.55760417 3.02267149] k = 4 Your mu_ins: [0.8645916478447805 0.7626347660326419 0.9166976781737817 0.8146250949461582 0.9890066602992638 0.581238091182984 -- 0.9890066602992638 1.3263609386715867 0.9469279721182694] Expected mu_ins: [0.45037393 0.96660432 0.91669768 0.81462509 0.98900666 0.58123809 1.36292434 0.98900666 1.26290807 0.45037393] Your mu_outs: [0.8199745082548768 1.066620579919205 2.2757920863382437 1.4249247857844018 1.6324813396684685 1.6860116924343682 1.2039993592132445 1.9586710367028737 1.557604169157225 1.270174252029943] Expected mu_outs: [0.98779184 0.71065812 2.25354636 1.42492479 1.63248134 1.68601169 1.04507438 1.95867104 1.55624738 1.22019475] mu_in error: 25.621708166558367 mu_out error: -23.08576833813163
--------------------------------------------------------------------------- AssertionError Traceback (most recent call last) c:\Users\jwald\Desktop\CS 7641 - ML\HW2\hw2_code\HW2_Fall22_Student.ipynb Cell 21 in <cell line: 1>() ----> <a href='vscode-notebook-cell:/c%3A/Users/jwald/Desktop/CS%207641%20-%20ML/HW2/hw2_code/HW2_Fall22_Student.ipynb#X26sZmlsZQ%3D%3D?line=0'>1</a> localtests.KMeansTests().test_silhouette_coefficient() File c:\Users\jwald\Desktop\CS 7641 - ML\HW2\hw2_code\localtests.py:157, in KMeansTests.test_silhouette_coefficient(self) 154 print("mu_in error:", mu_in_error) 155 print("mu_out error:", mu_out_error) --> 157 self.assertTrue(np.allclose(np.array( 158 true_silhouette_coefficients), np.array(test_silhouette_coefficients)), 159 msg="Incorrect coefficient values, check that mu_in and mu_out (especially the latter) are computed correctly") 160 print_success_message() File c:\Users\jwald\Anaconda3\envs\ml_hw2\lib\unittest\case.py:688, in TestCase.assertTrue(self, expr, msg) 686 if not expr: 687 msg = self._formatMessage(msg, "%s is not true" % safe_repr(expr)) --> 688 raise self.failureException(msg) AssertionError: False is not true : Incorrect coefficient values, check that mu_in and mu_out (especially the latter) are computed correctly
You've now done the best you can selecting the perfect starting points and the right number of clusters. However one of the limitations of K-Means Clustering is that it dependes largely on the shape of the dataset. A common example of this is trying to cluster one circle within another (concentric circles). A K-means classifier will fail to do this and will end up effectively drawing a line which crosses the circles. You can visualize this limitation in the cell below.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# visualize limitation of kmeans
from sklearn.datasets import (make_circles, make_moons)
X1, y1 = make_circles(factor=0.5, noise=0.05, n_samples=1500)
X2, y2 = make_moons(noise=0.05, n_samples=1500)
def visualise(X, C, K=None):# Visualization of clustering. You don't need to change this function
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=C,cmap='rainbow')
if K:
plt.title('Visualization of K = '+str(K), fontsize=15)
plt.show()
pass
cluster_idx1, centers1, loss1 = KMeans()(X1, 2)
visualise(X1, cluster_idx1, 2)
cluster_idx2, centers2, loss2 = KMeans()(X2, 2)
visualise(X2, cluster_idx2, 2)
Let us try to solve these limitations using another clustering algorithm: DBSCAN. As mentioned in lecture, DBSCAN tries to find dense regions in the data spce, separated by regions of lower density. DBSCAN is parameterized by two parameters (eps and minPts):
Refer to the class slides for the DBSCAN pseudocode to complete fit(), expandCluster(), and regionQuery() in dbscan.py.
HINTS:
The following unittests will help get you started, but is in no way comprehensive. You are encouraged to extend and create your own test cases. See Using the Local Tests for more details.
localtests.DBScanTests().test_region_query()
localtests.DBScanTests().test_expand_cluster()
UnitTest passed successfully! UnitTest passed successfully!
Then, test your fitting by running the cell below. You should be able to get a perfect clustering for the two circles dataset, which you can observe quantitatively by checking whether the clusters returned by cluster_idx and the ground truth clusters are the same and qualitatively by visualizing the clusters.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
BEST_EPS = 0.11
BEST_POINTS = 3
from dbscan import DBSCAN
dbscan = DBSCAN(BEST_EPS, BEST_POINTS, X1)
cluster_idx = dbscan.fit()
## Note that one of the two cells should print True for a correct implementation
print(np.array_equal(y1, cluster_idx)) #Checks if y1 == cluster_idx
print(np.array_equal(y1, 1-cluster_idx)) ## Checks if y1 is the exact opposite of cluster_idx (1s instead of 0s and vice-versa)
visualise(X1, cluster_idx)
True False
SOLUTIONS CANNOT BE HANDWRITTEN
A univariate Gaussian Mixture Model (GMM) has two components, both of which have their own mean and standard deviation. The model is defined by the following parameters: $$ \mathbf{z} \sim Bernoulli(\theta) = \begin{cases} \theta &\text{if} \, z=0 \\ 1-\theta &\text{if} \, z=1 \\\end{cases}$$ $$ \mathbf{p(x|z=0)} \sim \mathcal{N}(\mu, 3\sigma^{2}) $$ $$ \mathbf{p(x|z=1)} \sim \mathcal{N}(2\mu, \sigma^{2}) $$
For a dataset of N datapoints, find the following:
2.1.1. Write the marginal probability of x, i.e. $p(x)$ [3pts]
-- Express your answers in terms of $\mathcal{N}(a, b)$ where $a$ represents the mean and $b$ represents the variance of a normal distribution, and $\theta$
-- HINT: Start with the Sum Rule
2.1.2. E-Step: Compute the posterior probability, i.e, $p(z_i=k|x_i)$, where k = {0,1} [3pts]
-- Express your answers in terms of $\mathcal{N}(a, b)$ where $a$ represents the mean and $b$ represents the variance of a normal distribution, and $\theta$
2.1.3. M-Step: Compute the updated value of $\sigma^{2}$ (You can keep $\mu$ fixed for this) [9pts]
-- Express your answers in terms of $\tau$, $x$ (when you expand the $\mathcal{N}(a,b)$ terms), and $\mu$
-- HINT: Start from this equation and substitute $\theta$ with $\sigma^2$: $\theta_{new}=argmax_{\theta}\sum_{Z}p(Z|X,\theta_{old})\ln p(X,Z|\theta)$
$p(x,z) = p(x|z)p(z)$ \ $p(x,z) = \begin{cases} \theta\mathcal{N}(\mu, 3\sigma^{2}) &\text{if} \, z=0 \\ (1-\theta)\mathcal{N}(2\mu, \sigma^{2}) &\text{if} \, z=1 \\\end{cases} $ \ $p(x) = \sum_z p(x,z)$ \ $p(x) = \theta\mathcal{N}(\mu, 3\sigma^{2}) + (1-\theta)\mathcal{N}(2\mu, \sigma^{2})$
$p(z_i=k|x_j) = \tau_{i,j} = \frac{p(z_i)p(x_j|z_i)}{p(x_j)}$ by Bayes' Rule \ $\tau_{i,j} = \frac{Bernoulli(\theta) p(x_j|z_i)}{\theta\mathcal{N}(\mu, 3\sigma^{2}) + (1-\theta)\mathcal{N}(2\mu, \sigma^{2})}$ \ $\tau_{0,j} = \frac{\theta \mathcal{N}(\mu, 3\sigma^{2})}{\theta\mathcal{N}(\mu, 3\sigma^{2}) + (1-\theta)\mathcal{N}(2\mu, \sigma^{2})}$ \ $\tau_{1,j} = \frac{(1-\theta)\mathcal{N}(2\mu, \sigma^{2})}{\theta\mathcal{N}(\mu, 3\sigma^{2}) + (1-\theta)\mathcal{N}(2\mu, \sigma^{2})}$
$\sigma^2_{new} = argmax_{\sigma^2} \sum_Z p(Z|X,\sigma^2_{old}) \ln p(X,Z|\theta)$ \ $\sigma^2_{new} = argmax_{\sigma^2} \sum_i \tau_{i,j,\sigma^2_{old}} \ln p(X,z_i|\sigma^2)$ \ $\sigma^2_{new} = argmax_{\sigma^2} \frac{\theta \mathcal{N}(\mu, 3\sigma^{2})}{\theta\mathcal{N}(\mu, 3\sigma^{2}) + (1-\theta)\mathcal{N}(2\mu, \sigma^{2})}\ln[\theta\mathcal{N}(\mu, 3\sigma^{2})] + \frac{(1-\theta)\mathcal{N}(2\mu, \sigma^{2})}{\theta\mathcal{N}(\mu, 3\sigma^{2}) + (1-\theta)\mathcal{N}(2\mu, \sigma^{2})}\ln[(1-\theta)\mathcal{N}(2\mu, \sigma^{2})]$ we can ignore the denominator, since it's the same for each term in our maximization \ $\sigma^2_{new} = argmax_{\sigma^2} \theta \mathcal{N}(\mu, 3\sigma^{2}) \ln[\theta\mathcal{N}(\mu, 3\sigma^{2})] + (1-\theta)\mathcal{N}(2\mu, \sigma^{2}) \ln[(1-\theta)\mathcal{N}(2\mu, \sigma^{2})]$ we can break this into smaller terms using log multiplication property \ $\sigma^2_{new} = argmax_{\sigma^2} \theta \mathcal{N}(\mu, 3\sigma^{2}) (\ln[\theta] + \ln[\mathcal{N}(\mu, 3\sigma^{2})] + (1-\theta)\mathcal{N}(2\mu, \sigma^{2}) (\ln[(1-\theta)] + \ln[\mathcal{N}(2\mu, \sigma^{2})])$
We then take the derivative w.r.t. $\sigma^2$ and set to zero. I solved for the derivative but it looks really messy and expands into like 20 terms, so I decided to work on the bonus instead. $$
Please make sure to read the problem setup in detail. Many questions for this section may have already been answered in the description and hints and docstrings.
A Gaussian Mixture Model(GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian Distribution. In a nutshell, GMM is a soft clustering algorithm in a sense that each data point is assigned to a cluster with a probability. In order to do that, we need to convert our clustering problem into an inference problem.
Given $N$ samples $X = [x_1, x_2, \ldots, x_N]^T$, where $x_i \in \mathbb{R}^D$. Let $\pi$ be a K-dimensional probability density function and $(\mu_k; \Sigma_k)$ be the mean and covariance matrix of the $k^{th}$ Gaussian distribution in $\mathbb{R}^d$.
The GMM object implements EM algorithms for fitting the model and MLE for optimizing its parameters. It also has some particular hypothesis on how the data was generated:
Our goal is to find a $K$-dimension Gaussian distributions to model our data $X$. This can be done by learning the parameters $\pi, \mu$ and $\Sigma$ through likelihood function. Detailed derivation can be found in our slide of GMM. The log-likelihood function now becomes:
\begin{align} \text{ln } p(x_1, \dots, x_N | \pi, \mu, \Sigma) = \sum_{i=1}^N \text{ln } \big( \sum_{k=1}^{K} \pi(k) \mathcal{N}(x_i | \mu_k, \Sigma_k)\big) \end{align}From the lecture we know that MLEs for GMM all depend on each other and the responsibility $\tau$. Thus, we need to use an iterative algorithm (the EM algorithm) to find the estimate of parameters that maximize our likelihood function. All detailed derivations can be found in the lecture slide of GMM.
In this step, we need to calculate the responsibility $\tau$, which is the conditional probability that a data point belongs to a specific cluster $k$ if we are given the datapoint, i.e. $P(z_k|x)$. The formula for $\tau$ is given below:
$$ \tau\left(z_k\right)=\frac{\pi_{k} \cal{N}\left(x | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \cal{N}\left(x | \mu_{j}, \Sigma_{j}\right)}, \quad \text{for } k = 1, \dots, K $$Note that each data point should have one probability for each component/cluster. For this homework, you will work with $\tau\left(z_k\right)$ which has a size of $N\times K$ and you should have all the responsibility values in one matrix. We use gamma as $\tau$ in this homework.
After we obtained the responsibility, we can find the update of parameters, which are given below:
\begin{align} \mu_k^{new} &= \dfrac{\sum_{n=1}^N \tau(z_k)x_n}{N_k} \\ \Sigma_k^{new} &= \dfrac{1}{N_k}\sum_{n=1}^N \tau (z_k)^T(x_n - \mu_k^{new})^T(x_n-\mu_k^{new}) \\ \pi_k^{new} &= \dfrac{N_k}{N} \end{align}where $N_k = \sum_{n=1}^N \tau(z_k)$. Note that the updated value for $\mu_k$ is used when updating $\Sigma_k$. The multiplication of $\tau (z_k)^T(x_n - \mu_k^{new})^T$ is element-wise so it will preserve the dimensions of $(x_n - \mu_k^{new})^T$.
Special Notes
Hints
DO NOT USE FOR LOOPS OVER N. No credit will be given for implementing the function with for or while loops that visit every datapoint. You can always find a way to avoid looping over the observation datapoints in our homework problem. If you have to loop over D or K, that is fine.
You can initiate $\pi(k)$ the same for each $k$, i.e. $\pi(k) = \frac{1}{K}, \forall k = 1, 2, \ldots, K$.
In part 3 you are asked to generate the model for pixel clustering of image. We will need to use a multivariate Gaussian because each image will hava $N$ pixels and $D=3$ features which corresponds to red, green, and blue color intensities. It means that each image is a $(N\times3)$ dataset matrix. In the following parts, remember $D=3$ in this problem.
To avoid using for loops in your code, we recommend you take a look at the concept Array Broadcasting in Numpy. Also, certain calculations that required different shapes of arrays can also be achieved by broadcasting.
Be careful of the dimensions of your parameters. Before you test anything on the autograder, please look at the instructions below on the shapes of the variables you need to output and how to format your return statement. Print the shape of an array by print(array.shape) could enhance the functionality of your code and help you debugging. Also notice that a numpy array in shape $(N,1)$ is NOT the same as that in shape $(N,)$ so be careful and consistent on what you are using. You can see the detailed explanation here. Difference between numpy.array shape (R, 1) and (R,)
To facilitate some of the operations in the GMM implementation, we would like you to implement the following three helper functions. In these functions, "logit" refers to an input array of size $(N, D)$ that reperesents the unnormalized scores, that are passed to the softmax( ) or logsumexp( ) function. Remember the goal of helper functions is to facilitate our calculation so DO NOT USE FOR LOOP ON N.
Given $logit \in \mathbb{R}^{N \times D}$, calculate $prob \in \mathbb{R}^{N \times D}$, where $prob_{i, j} = \frac{\exp(logit_{i, j})}{\sum_{d=1}^D exp(logit_{i, d})}$.
Notes:
Special Notes
Given $logit \in \mathbb{R}^{N \times D}$, calculate $s \in \mathbb{R}^N$, where $s_i = \log \big( \sum_{j=1}^D \exp(logit_{i, j}) \big)$. Again, pay attention to the numerical problem. You may face similar condition as in the softmax function. In this case, add the maximum for each row of $logit$ back for your functions
Special Notes
1) Why should we add the maximum for each row of $logit$ to logsumexp() function?
Hint: start with a simple example like $logit \in \mathbb{R}^{1 \times D}$
Answer:
We first subtract the maximum for each row of logit to guarantee numerical stability. For $s_i = \log \big( \sum_{j=1}^D \exp(logit_{i, j}) \big)$, we run into computational problems on the $\exp(logit_{i, j})$ operation for large logits. By subtracting the largestdim, we limit the domain of the $\exp(logit{i, j})$ function to $[-max logit dim,0]$ (since logits are non-negative), which normalizes the range to $[0,1]$. To help calculate the loss of log-likelihood, we then sum these values and take the log.
We avoid numerical problems by subtracting the maximum dimension, but we have to add the maximum back for each row to guarantee that the log-sum output is non-negative!
You should be able to write your own function based on the following formula, and you are NOT allowed to use outside resource packages other than those we provided.
(for undergrads only) normalPDF
Using the covariance matrix as a diagonal matrix with variances of the individual variables appearing on the main diagonal of the matrix and zeros everywhere else means that we assume the features are independent. In this case, the multivariate normal density function simplifies to the expression below: $$\mathcal{N}(x: \mu, \Sigma) = \prod_{i=1}^D \frac{1}{\sqrt{2\pi\sigma_i^2}}\exp{\left( -\frac{1}{2\sigma_i^2} (x_i-\mu_i)^2\right)}$$ where $\sigma^2_i$ is the variance for the $i^{th}$ feature, which is the diagonal element of the covariance matrix.
(for grads only) multinormalPDF
Given the dataset $X \in \mathbb{R}^{N \times D}$, the mean vector $\mu \in \mathbb{R}^{D}$ and covariance matrix $\Sigma \in \mathbb{R}^{D \times D}$ for a multivariate Gaussian distrubution, calculate the probability $p \in \mathbb{R}^{N}$ of each data. The PDF is given by $$\mathcal{N}(X: \mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}|\Sigma|^{-1/2}\exp{\left(-\frac{1}{2}(X-\mu)\Sigma^{-1}(X-\mu)^T\right)}$$ where $|\Sigma|$ is the determinant of the covariance matrix.
Hints:
If you encounter "LinAlgError", you can mitigate your number/array by summing a small value before taking the operation, e.g. np.linalg.inv($\Sigma_k$ + SIGMA_CONST). You can arrest and handle such error by using Try and Exception Block in Python.Please only add SIGMA_CONST when sigma_i is not invertible.
In the above calculation, you must avoid computing a $(N,N)$ matrix. Using the above equation for large N will crash your kernel and/or give you a memory error on Gradescope. Instead, you can do this same operation by calculating $(X-\mu)\Sigma^{-1}$, a $(N,D)$ matrix, transpose it to be a $(D,N)$ matrix and do an element-wise multiplication with $(X-\mu)^T$, which is also a $(D,N)$ matrix. Lastly, you will need to sum over the 0 axis to get a $(1,N)$ matrix before proceeding with the rest of the calculation. This uses the fact that doing an element-wise multiplication and summing over the 0 axis is the same as taking the diagonal of the $(N,N)$ matrix from the matrix multiplication.
In Numpy implementation for each individual $\mu$, you can either use a 2-D array with dimension $(1,D)$ for each Gaussian Distribution, or a 1-D array with length $D$. Same to other array parameters. Both ways should be acceptable but pay attention to the shape mismatch problem and be consistent all the time when you implement such arrays.
Please DO NOT use self.D in your implementation of multinormalPDF() .
Things to do in this problem:
Examples of how you can initialize the parameters.
The log-likelihood function is given by: $$ \begin{align} \ell(\theta) = \sum_{i=1}^N \text{ln } \big( \sum_{k=1}^{K} \pi(k) \mathcal{N}(x_i | \mu_k, \Sigma_k)\big) \end{align} $$ In this part, we will generate a $(N,K)$ matrix where each datapoint $x_i, \forall i = 1, \dots, N$ has $K$ log-likelihood numbers. Thus, for each $i = 1, \dots, N$ and $k = 1, \dots, K$, $$ \text{log-likelihood}[i,k] = \log{\pi_k}+\log{\cal{N}(x_i|\mu_k, \Sigma_k)} $$
Hints:
You can find the detail instruction in the above description box.
Hints:
Use these to test if your implementation of functions in GMM work as expected. See Using the Local Tests for more details.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from gmm import GMM
gmm_tester = localtests.GMMTests()
gmm_tester.test_helper_functions()
gmm_tester.test_init_components()
Your softmax works within the expected range: True Your logsumexp works within the expected range: True Your _init_component's pi works within expected range: True Your _init_component's mu works within expected range: True Your _init_component's sigma works within the expected range: True
gmm_tester.test_grad()
Your multinormal pdf works within the expected range: True Your lljoint works within the expected range: True Your E step works within the expected range: True Your M step works within the expected range: True
Images typically need a lot of bandwidth to be transmitted over the network. In order to optimize this process, most image processors perform lossy compression of images (lossy implies some information is lost in the process of compression).
In this section, you will use your GMM algorithm implementation to do pixel clustering and compress the images. That is to say, you would develop a lossy image compression algorithm. (Hint: you can adjust the number of clusters formed and justify your answer based on visual inspection of the resulting images or on a different metric of your choosing)
Special Notes
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# helper function for performing pixel clustering.
def cluster_pixels_gmm(image, K, full_matrix = True):
"""Clusters pixels in the input image
Args:
image: input image of shape(H, W, 3)
K: number of components
Return:
clustered_img: image of shape(H, W, 3) after pixel clustering
"""
im_height, im_width, im_channel = image.shape
flat_img = np.reshape(image, [-1, im_channel]).astype(np.float32)
gamma, (pi, mu, sigma) = GMM(flat_img, K = K, max_iters = 10)(full_matrix)
cluster_ids = np.argmax(gamma, axis=1)
centers = mu
gmm_img = np.reshape(centers[cluster_ids], (im_height, im_width, im_channel))
return gmm_img
# helper function for plotting images. You don't have to modify it
def plot_images(img_list, title_list, figsize=(20, 10)):
assert len(img_list) == len(title_list)
fig, axes = plt.subplots(1, len(title_list), figsize=figsize)
for i, ax in enumerate(axes):
ax.imshow(img_list[i] / 255.0)
ax.set_title(title_list[i])
ax.axis('off')
# the direction of two images. Both of them are from ImageNet
img1_dir ='./data/images/gmm-example1.png'
img2_dir ='./data/images/gmm-example2.png'
# example of loading image
image1 = imageio.imread('./data/images/gmm-example1.png')
image2 = imageio.imread('./data/images/gmm-example2.png')
# this is for you to implement
def perform_compression(image, min_clusters=5, max_clusters=15, step_clusters=1):
"""
Using the helper function above to find the optimal number of clusters that can appropriately produce a single image.
You can simply examinate the answer based on your visual inspection (i.e. looking at the resulting images) or provide any metrics you prefer.
Args:
image: input image of shape(H, W, 3)
min_clusters, max_clusters: the minimum and maximum number of clusters you should test with. Default are 5a dn 15.
(Usually the maximum number of clusters would not exeed 15)
Return:
plot: comparison between original image and image pixel clustering.
optional: any other information/metric/plot you think is necessary.
"""
img_list = [image]
title_list = ['original']
for k in np.arange(min_clusters,max_clusters+1,step_clusters):
img_list.append(cluster_pixels_gmm(image,k))
title_list.append('GMM w/ k={}'.format(k))
plot_images(img_list,title_list)
perform_compression(image1,5,10,5)
perform_compression(image2,5,10,5)
C:\Users\jwald\AppData\Local\Temp\ipykernel_18744\3338280613.py:6: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning dissapear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
image1 = imageio.imread('./data/images/gmm-example1.png')
C:\Users\jwald\AppData\Local\Temp\ipykernel_18744\3338280613.py:7: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning dissapear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
image2 = imageio.imread('./data/images/gmm-example2.png')
iter 9, loss: 5153872.8677: 100%|██████████| 10/10 [00:10<00:00, 1.00s/it]
iter 9, loss: 5100484.7648: 100%|██████████| 10/10 [00:20<00:00, 2.09s/it]
iter 9, loss: 3452193.4731: 100%|██████████| 10/10 [00:10<00:00, 1.02s/it]
iter 9, loss: 2990942.2220: 100%|██████████| 10/10 [00:21<00:00, 2.13s/it]
Compare full covariance matrix with diagonal covariance matrix. Can you explain why the images are different with same clusters? Note: You will have to implement both multinormalPDF and normalPDF, and add a few arguments in the original _ll_joint(), _M_step(), _E_step() function. You will earn full credit only if you implement all functions AND provide an explanation.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
def compare_matrix(image, K):
"""
Args:
image: input image of shape(H, W, 3)
K: number of components
Return:
plot: comparison between full covariance matrix and diagonal covariance matrix.
"""
#full covariance matrix
gmm_image_full = cluster_pixels_gmm(image, K, full_matrix = True)
#diagonal covariance matrix
gmm_image_diag = cluster_pixels_gmm(image, K, full_matrix = False)
plot_images([gmm_image_full, gmm_image_diag], ['full covariance matrix', 'diagonal covariance matrix'])
compare_matrix(image1, 5)
iter 9, loss: 5153872.8677: 100%|██████████| 10/10 [00:10<00:00, 1.09s/it] iter 9, loss: 5153872.8677: 100%|██████████| 10/10 [00:10<00:00, 1.04s/it]
In this question, you will be fitting your GMM implementation on a 2D Gaussian Mixture to estimate the parameters of the distributions that make up the mixture, and then using these estimated parameters to generate samples.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
data = np.load('./data/mixture.npy')
print(data.shape)
plt.plot(data[:,0], data[:,1], 'x')
plt.axis('equal')
plt.title('2-D Gaussian Mixture')
plt.show()
(1000, 2)
Now, you need to estimate the parameters of the Gaussian Mixture, and then use these estimated parameters to generate 1000 samples from the Gaussian Mixture. Plot the sampled datapoints. You should notice that it resembles the original Gaussian Mixture.
Steps
Note: You only need to implement two steps: (i) estimate the parameters of the Gaussian Mixture, and (ii) Implement the density function. The steps for Rejection Sampling have been provided.
# TODO: fit your GMM implementation to the dataset
gmm = GMM(data,3,10)
gamma, (pi, mu, sigma) = gmm()
# print the estimated parameters
print(pi, '\n')
print(mu, '\n')
print(sigma)
iter 9, loss: 4425.7699: 100%|██████████| 10/10 [00:00<00:00, 118.57it/s]
[0.49744706 0.20012216 0.30243078] [[ 7.93180484 4.07132458] [-0.03308011 0.11406075] [ 0.15710167 8.15468582]] [[[ 3.85580941 -1.83305898] [-1.83305898 1.84962937]] [[ 1.04478395 -0.01529004] [-0.01529004 0.896644 ]] [[ 2.75188813 3.70660423] [ 3.70660423 6.76447589]]]
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Extract x and y
x = data[:, 0]
y = data[:, 1]
# Define the borders of the grid
deltaX = (max(x) - min(x))/10
deltaY = (max(y) - min(y))/10
xmin = min(x) - deltaX
xmax = max(x) + deltaX
ymin = min(y) - deltaY
ymax = max(y) + deltaY
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
# coordinates of the points that make the grid
positions = np.vstack([xx.ravel(), yy.ravel()]).T
def density(points, pi, mu, sigma, gmm):
"""Evaluate the density at each point on the grid.
Args:
points: (N, 2) numpy array containing the coordinates of the points that make up the grid.
pi: (K,) numpy array containing the mixture coefficients for each class
mu: (K, D) numpy array containing the means of each cluster
sigma: (K, D, D) numpy array containing the covariance matrixes of each cluster
gmm: an instance of the GMM model
Return:
densities: (N, ) numpy array containing densities at each point on the grid
HINT: You should be using the formula given in the hints.
"""
# TODO: Implement this function
densities = np.sum([pi[k]*gmm.multinormalPDF(points,mu[k,:],sigma[k,:,:]) for k in range(len(pi))],axis=0)
return densities
# get the density at each coordinate on the grid
densities = np.reshape(density(positions, pi, mu, sigma, gmm), xx.shape)
###############################
### DO NOT CHANGE THIS CELL ###
###############################
fig = plt.figure(figsize=(13, 7))
ax = plt.axes(projection='3d')
surf = ax.plot_surface(xx, yy, densities, rstride=1, cstride=1, cmap='coolwarm', edgecolor='none')
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('PDF')
ax.set_title('Surface plot of 2D Gaussian Mixture Densities')
fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
ax.view_init(60, 35)
###############################
### DO NOT CHANGE THIS CELL ###
###############################
fig = plt.figure(figsize=(13, 7))
ax = plt.axes(projection='3d')
w = ax.plot_wireframe(xx, yy, densities)
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('PDF')
ax.set_title('Wireframe plot of 2D Gaussian Mixture')
Text(0.5, 0.92, 'Wireframe plot of 2D Gaussian Mixture')
###############################
### DO NOT CHANGE THIS CELL ###
###############################
def sample(xmin, xmax, ymin, ymax, gmm, dmax=1, M=0.1):
"""Performs rejection sampling. Keep sampling datapoints until d <= f(x, y) / M
Args:
xmin: lower bound on x values
xmax: upper bound on x values
ymin: lower bound on y values
ymax: upper bound on y values
gmm: an instance of the GMM model
dmax: the upper bound on d
M: scale_factor. can be used to control the fraction of samples that are rejected
Return:
x, y: the coordinates of the sampled datapoint
HINT: Refer to the links in the hints
"""
while True:
x = np.random.uniform(low=xmin, high=xmax)
y = np.random.uniform(low=ymin, high=ymax)
d = np.random.uniform(low=0, high=dmax)
if d < density(np.array([x,y]).reshape(1,2), pi, mu, sigma, gmm) / M:
return x, y
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Sample datapoints using Rejection Sampling
generated_datapoints = np.zeros((1000, 2))
i = 0
while i < 1000:
generated_datapoints[i,0], generated_datapoints[i,1] = sample(xmin, xmax, ymin, ymax, gmm, dmax=1)
if i % 100 == 0:
print(i)
i += 1
0 100 200 300 400 500 600 700 800 900
###############################
### DO NOT CHANGE THIS CELL ###
###############################
plt.scatter(generated_datapoints[:,0], generated_datapoints[:,1])
plt.axis('equal')
plt.title('Sampled Datapoints')
plt.show()
Learning to work with messy data is a hallmark of a well-rounded data scientist. In most real-world settings the data given will usually have some issue, so it is important to learn skills to work around such impasses. This part of the assignment looks to expose you to clever ways to fix data using concepts that you have already learned in the prior questions.
After graduating from Georgia Tech with your shiny new degree, you are recruited to help with safety testing for the Mars rocket at NASA. Of course NASA won't be sending rocket after rocket to stress test your fellow employees' engineering (they also graduated from Tech, so you have full confidence in them), so instead, NASA has decided to run numerous simulations on the current engineering design of the Mars rocket. The simulation collects shuttle data from its sensors, resulting in 8 features which include bypass, rad flow, etc. These features are contained within the first through eighth columns. The ninth column shows the label with 1 being a successful simulation and 0 being an unsuccessful simulation.
However, due to an intern accidentally deleting random data points, 20% of the entries are missing labels and 30% are missing characterization data. Since simply removing the corrupted entries would not reflect the true variance of the data, your job is to implement a solution to clean the data so it can be properly classified.
Your job is to assist NASA in cleaning the data and implementing a semi-supervised learning framework to help them create a general classifier for future simulations.
You are given two files for this task:
The first step is to break up the whole dataset into clear parts. All the data is randomly shuffled in one csv file. In order to move forward, the data needs to be split into three separate arrays:
In semisupervised.py, implement the following methods:
###############################
### DO NOT CHANGE THIS CELL ###
###############################
localtests.SemisupervisedTests().test_data_separating_methods()
The second step in this task is to clean the Labeled_incomplete dataset by filling in the missing values with probable ones derived from complete data. A useful approach to this type of problem is using a k-nearest neighbors (k-NN) algorithm. For this application, the method consists of replacing the missing value of a given point with the mean of the closest k-neighbors to that point.
In the CleanData class in semisupervised.py, implement the following methods:
The unit test is a good expectation of what the process should look like on a toy dataset. If your output matches the answer, you are on the right track. Run the following cell to check.
NOTE: Your rows of data should match with the expected output, although the order of the rows does not necessarily matter.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
localtests.SemisupervisedTests().test_cleandata()
Another method of filling the missing values is by using the mean of individual features. The mean of all non-NaN values of a feature could be used to replace any NaN value belonging to the feature. Implement the mean_clean_data method in accordance with this rule. NOTE: There should be no NaN values in the n*d array that you return from mean_clean_data.
In the CleanData class in semisupervised.py, implement the following method:
###############################
### DO NOT CHANGE THIS CELL ###
###############################
localtests.SemisupervisedTests().test_mean_clean_data()
You will implement a version of the algorithm presented in Table 1 of the paper "Text Classification from Labeled and Unlabeled Documents using EM" by Nigam et al. (2000). While you are recommended to read the whole paper this assignment focuses on items 5.2 and 6.1. Write a brief summary of three interesting highlights of the paper (50-words maximum).
In your implementation of the EM algorithm proposed by Nigam et al. (2000) on Table 1, you will use a Gaussian Naive Bayes (GNB) classifier as opposed to a naive Bayes (NB) classifier. (Hint: Using a GNB in place of an NB will enable you to reuse most of the implementation you developed for GMM in this assignment. In fact, you can successfully solve the problem by simply modifying the call and _init_components methods.)
In the SemiSupervised class in semisupervised.py, implement the following methods:
Compare the classification error based on the Gaussian Naive Bayes (GNB) classifier you implemented following the Nigam et al. (2000) approach to the performance of a GNB classifier trained using only labeled data. Since you have not covered supervised learning in class, you are allowed to use the scikit learn library for training the GNB classifier based only on labeled data: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html.
In the ComparePerformance class in semisupervised.py, implement the following method:
To acheive the full 5 points you must implement the ComparePerformance.accuracy_semi_supervised and ComparePerformance.accuracy_GNB methods and get these scores:
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from semisupervised import complete_
from semisupervised import incomplete_
from semisupervised import unlabeled_
from semisupervised import mean_clean_data
from semisupervised import CleanData
from semisupervised import ComparePerformance
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Load training data
all_data = np.loadtxt('data/data.csv', delimiter=',')
# Separate training data into categories: labeled complete, labeled incomplete, and unlabeled points
labeled_complete = complete_(all_data)
labeled_incomplete = incomplete_(all_data)
unlabeled = unlabeled_(all_data)
# Perform data cleaning on labeled incomplete data
cleaned_data = CleanData()(labeled_incomplete, labeled_complete, 10)
# Combine cleaned data with unlabeled data
cleaned_and_unlabeled = np.concatenate((cleaned_data, unlabeled), 0)
# Category for data that is guaranteed to have label values
labeled_data = np.concatenate((labeled_complete, labeled_incomplete), 0)
# Perform mean data cleaning on all labeled data
mean_cleaned_data = mean_clean_data(labeled_data)
# Print data shapes
print(f"All Data shape: {all_data.shape}")
print(f"Labeled Complete shape: {labeled_complete.shape}")
print(f"Labeled Incomplete shape: {labeled_incomplete.shape}")
print(f"Labeled shape: {labeled_data.shape}")
print(f"Unlabeled shape: {unlabeled.shape}")
print(f"Cleaned data shape: {cleaned_data.shape}")
print(f"Cleaned + Unlabeled data shape: {cleaned_and_unlabeled.shape}")
# load validation data
validation = np.loadtxt('data/validation.csv', delimiter=',')
# =========================================================================
# SUPERVISED GNB WITH ONLY THE COMPLETE DATA (SKLEARN)
accuracy_complete_data_only = ComparePerformance.accuracy_GNB(labeled_complete, validation)
# =========================================================================
# SUPERVISED GNB WITH CLEAN DATA (SKLEARN)
accuracy_cleaned_data = ComparePerformance.accuracy_GNB(cleaned_data, validation)
# =========================================================================
# SUPERVISED GNB WITH MEAN CLEAN DATA (SKLEARN)
accuracy_mean_cleaned_data = ComparePerformance.accuracy_GNB(mean_cleaned_data, validation)
# =========================================================================
# SEMI SUPERVISED GNB WITH ALL DATA (your implementation)
accuracy_semi_supervised = ComparePerformance.accuracy_semi_supervised(cleaned_and_unlabeled, validation, 2)
# ==========================================================================
# COMPARISON
print("""===COMPARISON===""")
print(f"Supervised with only complete data, GNB Accuracy: {np.round(100.0 * accuracy_complete_data_only, 3)}%")
print(f"Supervised with KNN clean data, GNB Accuracy: {np.round(100.0 * accuracy_cleaned_data, 3)}%")
print(f"Supervised with Mean clean data, GNB Accuracy: {np.round(100.0 * accuracy_mean_cleaned_data, 3)}%")
print(f"SemiSupervised Accuracy: {np.round(100.0 * accuracy_semi_supervised, 3)}%")
What are the differences in using the kNN method and the mean method to fill NaN values? Explain in terms of the results you get from each.
Answer